-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implemented L1 distance & fixed L2 in binary quantization #21
Conversation
I just realized the changes in C implementations broke qdrant windows build, I'm trying to understand why cl.exe can't build these. |
I had seen the l1_avx is not used warning, but it is being used in the benchmarks. I could remove the benchmarks and the implementation of course but do you have a better idea? |
I have realized that this L1 distance is currently not compatible with binary quantization. I had followed mostly what had been done for L2, but I suspect L2 is also incompatible with it. Edit: On second thought, I think it may already be compatible as the current code uses XOR in case of invert = true and the truth tables match perfectly (as it's the actual Hamming distance in the case of L1)
|
I have fixed both L1 and L2 distances and added tests for both in case of binary quantization. |
I have run a benchmark with real data using Euclid + binary quantization (using qdrant itself). It did not work well. I need to investigate and understand the difference between the toy test data and the real one. I'll also check L1 tomorrow. Meanwhile, I have set the status to draft again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first review pass with some minor comments. Thank you very much for your time and effort to implement this! We'll definitely take a proper look at this.
I have been banging my head on this all day. I am pretty sure L1 and L2 distances and inverses (similarities) are calculated properly. I have gone through unit tests, added integration tests that use binary and scalar quantizations on qdrant...etc. they all work as I expect. Nevertheless, when I run our benchmark image vector dataset with this implementation, I still get 0% accuracy with L1 and L2 using binary quantization while dot product gets about 84%. I will let it go if I can justify it as "our problem is just not compatible with binary quantization" but the fact that dot product works annoys me to no end. I am open to suggestions. |
OK, I had an epiphany where I understood that the binary I have relaxed the L1 and L2 binary tests to just check the similarity order but not care about the actual score values. This way I could see that the previous implementation with dot product is still on point for both L1 and L2, but it was just inverse of what it was supposed to be. The many layers of inversing confused me too. I apologize to the reviewers that they had to go through my learning experience with me. :) I will run the large benchmark with this latest version as soon as I can build a docker image. edit: aaaand we have a winner! 🍾 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I must admit that I didn't look into the C implementations. Maybe @IvanPleshkov can take a look.
It's a great work! I really wondered! Impressed with correct scalar quantization |
This pull request is related to qdrant/qdrant#3052.
It implements L1 distance type to be used for the matching Manhattan distance implementation in qdrant/qdrant.
I have tried to make the changes as non-disruptive as possible, but since L1 distance cannot be related to dot product as easily as L2, there had to be some branching between scoring implementations.
I have added tests and benchmarks and verified the implementation works for simple, AVX, SSE and Neon instructions.